5 research outputs found
A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching
Entity Matching (EM) is a core data cleaning task, aiming to identify
different mentions of the same real-world entity. Active learning is one way to
address the challenge of scarce labeled data in practice, by dynamically
collecting the necessary examples to be labeled by an Oracle and refining the
learned model (classifier) upon them. In this paper, we build a unified active
learning benchmark framework for EM that allows users to easily combine
different learning algorithms with applicable example selection algorithms. The
goal of the framework is to enable concrete guidelines for practitioners as to
what active learning combinations will work well for EM. Towards this, we
perform comprehensive experiments on publicly available EM datasets from
product and publication domains to evaluate active learning methods, using a
variety of metrics including EM quality, #labels and example selection
latencies. Our most surprising result finds that active learning with fewer
labels can learn a classifier of comparable quality as supervised learning. In
fact, for several of the datasets, we show that there is an active learning
combination that beats the state-of-the-art supervised learning result. Our
framework also includes novel optimizations that improve the quality of the
learned model by roughly 9% in terms of F1-score and reduce example selection
latencies by up to 10x without affecting the quality of the model.Comment: accepted for publication in ACM-SIGMOD 2020, 15 page
Complaint-driven Training Data Debugging for Query 2.0
As the need for machine learning (ML) increases rapidly across all industry
sectors, there is a significant interest among commercial database providers to
support "Query 2.0", which integrates model inference into SQL queries.
Debugging Query 2.0 is very challenging since an unexpected query result may be
caused by the bugs in training data (e.g., wrong labels, corrupted features).
In response, we propose Rain, a complaint-driven training data debugging
system. Rain allows users to specify complaints over the query's intermediate
or final output, and aims to return a minimum set of training examples so that
if they were removed, the complaints would be resolved. To the best of our
knowledge, we are the first to study this problem. A naive solution requires
retraining an exponential number of ML models. We propose two novel heuristic
approaches based on influence functions which both require linear retraining
steps. We provide an in-depth analytical and empirical analysis of the two
approaches and conduct extensive experiments to evaluate their effectiveness
using four real-world datasets. Results show that Rain achieves the highest
recall@k among all the baselines while still returns results interactively.Comment: Proceedings of the 2020 ACM SIGMOD International Conference on
Management of Dat